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Abstract 

When using deep, multi-layered architectures to build generative 
models of data, it is difficult to train all layers at once. We propose 
a layer-wise training procedure admitting a performance guarantee 
C^l compared to the global optimum. It is based on an optimistic proxy 

of future performance, the best latent marginal. We interpret auto- 
encoders in this setting as generative models, by showing that they train 
a lower bound of this criterion. We test the new learning procedure 
\^ against a state of the art method (stacked RBMs) , and find it to improve 

^-H performance. Both theory and experiments highlight the importance, 

when training deep architectures, of using an inference model (from 
data to hidden variables) richer than the generative model (from hidden 
variables to data). 

lii Introduction 

Deep architectures, such as multiple-layer neural networks, have recently been 
the object of a lot of interest and have been shown to provide state-of-the-art 
CN performance on many problems [■]]. A key aspect of deep learning is to 

help in learning better representations of the data, thus reducing the need 
for hand-crafted features, a very time-consuming process requiring expert 
knowledge. 

CsJ Due to the difficulty of training a whole deep network at once, a so- 

called layer-wise procedure is used as an approximation [12, 1]. However, a 
^ long-standing issue is the justification of this layer-wise training: although 

^ the method has shown its merits in practice, theoretical justifications fall 

?H somewhat short of expectations. A frequently cited result [J2] is a proof that 

adding layers increases a so-called variational lower bound on the log-likelihood 
of the model, and therefore that adding layers can improve performance. 

We reflect on the validity of layer-wise training procedures, and discuss 
in what way and with what assumptions they can be construed as being 
equivalent to the non-layer- wise, that is, whole-network, training. This leads 
us to a new approach for training deep generative models, using a new criterion 
for optimizing each layer starting from the bottom and for transferring the 
problem upwards to the next layer. Under the right conditions, this new 
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layer-wise approach is equivalent to optimizing the log-likelihood of the full 
deep generative model (Theorem 1). 

As a first step, in Section 1 we re-introduce the general form of deep 
generative models, and derive the gradient of the log-likelihood for deep 
models. This gradient is seldom ever considered because it is considered 
intractable and requires sampling from complex distributions. Hence the 
need for a simpler, layer-wise training procedure. 

We then show (Section 2.1) how an optimistic criterion, the BLM upper 
bound, can be used to train optimal lower layers provided subsequent training 
of upper layers is successful, and discuss what criterion to use to transfer the 
learning problem to the upper layers. 

This leads to a discussion of the relation of this procedure with stacked 
restricted Boltzmann machines (SRBMs) and auto-encoders (Sections 2.3 
and 2.4), in which a new justification is found for auto-encoders as optimizing 
the lower part of a deep generative model. 

In Section 2.7 we spell out the theoretical advantages of using a model for 
the hidden variable h having the form Q{h) = (7(h|x)Pdata(x) when looking 
for hidden-variable generative models of the data x, a scheme close to that 
of auto-encoders. 

Finally, we discuss new applications and perform experiments (Section 3) 
to validate the approach and compare it to state-of-the-art methods, on two 
new deep datasets, one synthetic and one real. In particular we introduce 
auto-encoders with rich inference (AERIes) which are auto-encoders modified 
according to this framework. 

Indeed both theory and experiments strongly suggest that, when using 
stacked auto-associators or similar deep architectures, the inference part 
(from data to latent variables) should use a much richer model than the 
generative part (from latent variables to data), in fact, as rich as possible. 
Using richer inference helps to find much better parameters for the same 
given generative model. 

1 Deep generative models 

Let us go back to the basic formulation of training a deep architecture 
as a traditional learning problem: optimizing the parameters of the whole 
architecture seen as a probabilistic generative model of the data. 

1.1 Deep models: probability decomposition 

The goal of generative learning is to estimate the parameters 6 = {9i, . . . , On) 
of a distribution Pe^x) in order to approximate a data distribution Pd(x) on 
some observed variable x. 

The recent development of deep architectures [12, 1] has given importance 
to a particular case of latent variable models in which the distribution of x 
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can be decomposed as a sum over states of latent variables h, 

h 

with separate parameters for the marginal probability of h and the conditional 
probability of x given h. Setting I = {1, 2, . . . , A;} such that Oj is the set 
of parameters of P(x|h) and J = {/c + 1, . . . , n} such that 6j is the set of 
parameters of -P(h), this rewrites as 

P,(x) = J]P,,(x|h)P,,(h) (1) 
h 

In deep architectures, the same kind of decomposition is applied to h 
itself recursively, thus defining a layered model with several hidden layers 
h(i), h(^), . . . , h('^™'"^\ namely 

P,(x) = ^P,,^(x|h(i))P,,Jh(i)) (2) 
h(i) 

P(hW) = J] Pe,Jh(^)|h(^-+i))Pe,,Jh('=+i)), 1^ A:^ (3) 
h(fe+i) 

At any one time, we will only be interested in one step of this decomposi- 
tion. Thus for simplicity, we consider that the distribution of interest is on 
the observed variable x, with latent variable h. The results extend to the 
other layers of the decomposition by renaming variables. 

In Sections 2.3 and 2.4 we quickly present two frequently used deep 
architectures, stacked RBMs and auto-encoders, within this framework. 

1.2 Data log-likelihood 

The goal of the learning procedure, for a probabilistic generative model, is 
generally to maximize the log-likelihood of the data under the model, namely, 
to find the value of the parameter 6* = {9j,6j) achieving 

r := argmaxEx^Pj, [logPe(x)] (4) 
e 

= argmini:)KL(pD||Pe), (5) 
e 

where Pp is the empirical data distribution, and I?kl(" || ") is the Kullback- 
Leibler divergence. (For simplicity we assume this optimum is unique.) 

An obvious way to tackle this problem would be a gradient ascent over 
the full parameter 0. However, this is impractical for deep architectures 
(Section 1.3 below). 

It would be easier to be able to train deep architectures in a layer-wise 
fashion, by first training the parameters 9j of the bottom layer, deriving a 
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new target distribution for the latent variables h, and then training Oj to 
reproduce this target distribution on h, recursively over the layers, till one 
reaches the top layer on which, hopefully, a simple probabilistic generative 
model can be used. 

Indeed this is often done in practice, except that the objective (4) is 
replaced with a surrogate objective. For instance, for architectures made of 
stacked RBMs, at each level the likelihood of a single RBM is maximized, 
ignoring the fact that it is to be used as part of a deep architecture, and 
moreover often using a further approximation to the likelihood such as 
contrastive divergence [ ]. Under specific conditions (i.e., initializing the 
upper layer with an upside-down version of the current RBM), it can be 
shown that adding a layer improves a lower bound on performance [12]. 

We address in Section 2 the following questions: Is it possible to compute 
or estimate the optimal value of the parameters 0*^ of the bottom layer, 
without training the whole model? Is it possible to compare two values 
of Oj without training the whole model? The latter would be particularly 
convenient for hyper-parameter selection, as it would allow to compare lower- 
layer models before the upper layers are trained, thus significantly reducing 
the size of the hyper-parameter search space from exponential to linear in 
the number of layers. 

We propose a procedure aimed at reaching the global optimum 6* in a 
layer-wise fashion, based on an optimistic estimate of log-likelihood, the best 
latent marginal (BLM) upper hound. We study its theoretical guarantees 
in Section 2. In Section 3 we make an experimental comparison between 
stacked RBMs, auto-encoders modified according to this scheme, and vanilla 
auto-encoders, on two simple but deep datasets. 



1.3 Learning by gradient ascent for deep architectures 

Maximizing the likelihood of the data distribution -Px)(x) under a model, or 
equivalently minimizing the KL-divergence Dkl(-Pd || Pe)^ is usually done 
with gradient ascent in the parameter space. 

The derivative of the log-likelihood for a deep generative model can be 
written as: 

de Pe(x) ^ ' 

= E '^°^^;-^^'^^ ^.(hix) + E ^^^^gg^p.(hix)(7) 

h h 

by rewriting Peih.) / Pq{'x) = i-e(h|x)/P6i(x|h). The derivative w.r.t. a given 
component 9i of simplifies because 6i is either a parameter of Pgj(x|h) 
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when i £ I, or a parameter of Pgj(h.) when i £ J: 
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Unfortunately, this gradient ascent procedure is generahy intractable, because 
it requires sampling from Pgj^0j{h\x.) (where both the upper layer and lower 
layer influence h) to perform inference in the deep model. 

2 Layer-wise deep learning 
2.1 A theoretical guarantee 

We now present a training procedure that works successively on each layer. 
First we train 6j together with a conditional model g(h|x) for the latent 
variable knowing the data. This step involves only the bottom part of 
the model and is thus often tractable. This allows to infer a new target 
distribution for h, on which the upper layers can then be trained. 

This procedure singles out a particular setting Oj for the bottom layer of 
a deep architecture, based on an optimistic assumption of what the upper 
layers may be able to do (cf. Proposition 3). 

Under this procedure. Theorem 1 states that it is possible to obtain a 
validation that the parameter Oj for the bottom layer was optimal, provided 
the rest of the training goes well. Namely, if the target distribution for h 
can be realized or well approximated by some value of the parameters 6j of 
the top layers, and if 6j was obtained using a rich enough conditional model 
g(h|x), then {6i,0j) is guaranteed to be globally optimal. 

Theorem 1. Suppose the parameters 6j of the bottom layer are trained by 



where the arg max runs over all conditional probability distributions g(h|x) 
and where 



with Pp the observed data distribution. 

We call the optimal 9j the best optimistic lower layer (BOLL). Let ^©(h) 
be the distribution on h associated with the optimal q. Then: 




(10) 




(11) 



X 



5 



If the top layers can be trained to reproduce (^©(h) perfectly, i.e., if there 
exists a parameter 6j for the top layers such that the distribution P^^ (h) 
is equal to qx>{)L\), then the parameters obtained are globally optimal: 



• Whatever parameter value 6j is used on the top layers in conjunction 
with the BOLL 6j, the difference in performance (4) between [Oj, 9j) and 
the global optimum {OJjOj) is at most the Kullback-Leibler divergence 
-DKL(gD(h) II P6»j(h)) between qvih.) and Pej{h). 

This theorem strongly suggests using ^x)(h) as the target distribution for 
the top layers, i.e., looking for the value 6j best approximating qx>{h.): 

Oj := argminL>KL(gc(h) \\P0j{h)) = argmaxEh^^j, logPej(h) (12) 
ej 6j 

which thus takes the same form as the original problem. Then the same 
scheme may be used recursively to train the top layers. A final fine-tuning 
phase may be helpful, see Section 2.6. 

Note that when the top layers fail to approximate qj) perfectly, the loss 
of performance depends only on the observed difference between qxi and P^^, 
and not on the unknown global optimum 9*j). Beware that, unfortunately, 
this bound relies on perfect layer- wise training of the bottom layer, i.e., on q 
being the optimum of the criterion (10) optimized over all possible conditional 
distributions q; otherwise it is a priori not valid. 

In practice the supremum on q will always be taken over a restricted 
set of conditional distributions g(h|x), rather than the set of all possible 
distributions on h for each x. Thus, this theorem is an idealized version of 
practice (though Remark 4 below mitigates this). This still suggests a clear 
strategy to separate the deep optimization problem into two subproblems to 
be solved sequentially: 

1. Train the parameters 9i of the bottom layer after (10), using a model 
g(h|x) as wide as possible, to approximate the BOLL 9i. 

2. Infer the corresponding distribution of h by (11) and train the upper 
part of the model as best as possible to approximate this distribution. 

Then, provided learning is successful in both instances, the result is close 
to optimal. 

Auto-encoders can be shown to implement an approximation of this 
procedure, in which only the terms x = x are kept in (lO)-(ll) (Section 2.4). 

This scheme is designed with in mind a situation in which the upper layers 
get progessively simpler. Indeed, if the layer for h is as wide as the layer for 
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X and if P(x|h) can learn the identity, then the procedure in Theorem 1 just 
transfers the problem unchanged one layer up. 

This theorem strongly suggests decoupling the inference and generative 
models q(h|x) and P(x|h), and using a rich conditional model g(h|x), con- 
trary, e.g., to common practice in auto-encoders^. Indeed the experiments of 
Section 3 confirm that using a more expressive g(h|x) yields improved values 
oie. 

Importantly, (7(h|x) is only used as an auxiliary prop for solving the 
optimization problem (4) over 9 and is not part of the final generative model, 
so that using a richer g(h|x) to reach a better value of 6 is not simply changing 
to a larger model. Thus, using a richer inference model g(h|x) should not 
pose too much risk of overfitting because the regularization properties of the 
model come mainly from the choice of the generative model family (0). 

The criterion proposed in (10) is of particular relevance to representation 
learning where the goal is not to learn a generative model, but to learn a 
useful representation of the data. In this setting, training an upper layer 
model -P(h) becomes irrelevant because we are not interested in the generative 
model itself. What matters in representation learning is that the lower layer 
(i.e., P(x|h) and g(h|x)) is optimal for some model of -P(h), left unspecified. 

We now proceed, by steps, to the proof of Theorem 1. This will be the 
occasion to introduce some concepts used later in the experimental setting. 



2.2 The Best Latent Marginal Upper Bound 

One way to evaluate a parameter 6j for the bottom layer without training 
the whole architecture is to be optimistic: assume that the top layers will be 
able to produce the probability distribution for h that gives the best results 
if used together with PQj{x\h.). This leads to the following. 

Definition 2. Let 6j be a value of the bottom layer parameters. The best 
latent marginal (BLM) forOj is the probability distribution Q onh maximizing 
the log-likelihood: 



Qej,v ■= argmaxEx^p^ 



logJ]Pe,(x|h)Q(h) 



(13) 



where the arg max runs over the set of all probability distributions over h. 
The BLM upper bound is the corresponding log-likelihood value: 



Uv{Oi) '■= maxEx^Pj, 



iogj;Pe,(x|h)g(h) 



(14) 



^ Attempts to prevent auto-encoders from learning the identity (which is completely 
justifiable) often result in an even more constrained inference model, e.g., tied weights, or 
sparsity constraints on the hidden representation. 
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The BLM upper bound Ux>{Oi) is the least upper bound on the log- 
hkehhood of the deep generative model on the dataset P if Oj is used for the 
bottom layer. U-xy{9j) is only an upper bound of the actual performance of 9i, 
because subsequent training of Pqj (h) may be suboptimal: the best latent 
marginal Q6ij,c(h) may not be representable as ^^^(h) for 9j in the model, 
or the training of Pe^(h) itself may not converge to the best solution. 

Note that the arg max in (13) is concave in Q, so that in typical situations 
the BLM is unique — except in degenerate cases such as when two values of h 
define the same Pgj{x\h.)). 

Proposition 3. The criterion (10) used in Theorem 1 for training the bottom 
layer coincides with the BLM upper bound: 

(15) 

where the maximum runs over all conditional probability distributions q{h\x). 
In particular the BOLL 9j selected in Theorem 1 is 

9 J = arg max Ut>{9i) (16) 
ei 

and the target distribution gx)(h) in Theorem 1 is the best latent marginal 

Thus the BOLL 9i is the best bottom layer setting if one uses an optimistic 
criterion for assessing the bottom layer, hence the name "best optimistic lower 
layer". 

Proof. Any distribution Q over h can be written as qj) for some conditional 
distribution g(h|x), for instance by defining g(h|x) = Q{h) for every x in the 
dataset. In particular this is the case for the best latent marginal Qgj^jy- 

Consequently the maxima in (15) and in (14) are taken on the same set 
and coincide. □ 

The argument that any distribution is of the form qx) may look disappoint- 
ing: why choose this particular form? In Section 2.7 we show how writing 
distributions over h as qx> for some conditional distribution (/(hjx) may help 
to maximize data log-likelihood, by quantifiably incorporating information 
from the data (Proposition 7). Moreover, the bound on loss of performance 
(second part of Theorem 1) when the upper layers do not match the BLM 
crucially relies on the properties oi qx>. A more practical argument for using 
qj) is that optimizing both 9j and the full distribution of the hidden variable h 
at the same time is just as difficult as optimizing the whole network, whereas 
the deep architectures currently in use already train a model of x knowing h 
and of h knowing x at the same time. 

8 



Uv{6i) = maxExr^Pj. 



Remark 4. For Theorem 1 to hold, it is not necessary to optimize over all 
possible conditional probability distributions g(h|x) (which is a set of very 
large dimension). As can be seen from the proof above it is enough to optimize 
over a family q{h\x.) G Q such that every (non-conditional) distribution on h 
can be represented (or well approximated) as qvO^) for some q £ Q. 

Let us now go on with the proof of Theorem 1. 

Proposition 5. Set the bottom layer parameters to the BOLL 

9 1 = argmax Ut>{9i) (17) 

and let Q be the corresponding best latent marginal. 

Assume that subsequent training of the top layers using Q as the target 
distribution for h, is successful, i.e., there exists a Oj such that Q(h) = 

Then §1 = 9}. 
Proof. Define the in-model BLM upper bound as 



Kv"'^''\9i) :=maxE^ 



log5^Pe,(x|h)Pe,,(h) 



(18) 



By definition, the global optimum 9j for the parameters of the whole 
architecture is given by 9J = argmaxg^i^^°'^^^(^/). 

Obviously, for any value 9i we have U^°'^'^^{9j) ^ Ud{9j) since the argmax 
is taken over a more restricted set. Then, in turn, Ud{9j) ^ Ut>{9j) by 
definition of 9j. 

By our assumption, the BLM Q for 9j happens to lie in the model: 
Q(h) = P0j{h). This implies that Uv{9i) = U^°'^''\9j). 

Combining, we get that U^°'^''\9i) ^ U^°'^''\9i) for any 9i. Thus 9i 
maximizes ly(^°'^^^{9j), and is thus equal to □ 

The first part of Theorem 1 then results from the combination of Propo- 
sitions 5 and 3. 

We now give a bound on the loss of performance in case further training 
of the upper layers fails to reproduce the BLM. This will complete the 
proof of Theorem 1. We will make use of a special optimality property 
of distributions of the form gx)(h), namely. Proposition 7, whose proof is 
postponed to Section 2.7. 

Proposition 6. Keep the notation of Theorem 1. In the case when P0j(h) 
fails to reproduce qvO^) exactly, the loss of performance of{9j,9j) with respect 
to the global optimum {9},9j) is at most 

Dkl{Pv{^) \\Pe„ej(^)) " Dkl{Pv{^) Ni,,e/x)) (19) 
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where g^^^(x) := P^^(x|h)q2?(h) is the distribution on x obtained by 
using the BLM. 

This quantity is in turn at most 

L>KL(gi?(h)||P,j(h)) (20) 

which is thus also a bound on the loss of performance of (9j,9j) with respect 

to {e*j,e*j). 

Note that these estimates do not depend on the unkown global optimum 

e*. 

Importantly, this bound is not valid if q has not been perfectly optimized 
over all possible conditional distributions g(h|x). Thus it should not be used 
blindly to get a performance bound, since heuristics will always be used to 
find q. Therefore, it may have only limited practical relevance. In practice the 
real loss may both be larger than this bound because q has been optimized 
over a smaller set, and smaller because we are comparing to the BLM upper 
bound which is an optimistic assessment. 

Proof From (4) and (5), the difference in log- likelihood performance between 
any two distributions pi(x) and P2(x) is equal to -DklI^o — -Dkl(-Pd 1 1^*2) • 
For simplicity, denote 

m(x) = P,~^,,^(x) = Y,PeM\^)PeA^) 

h 

P2(X) = Pej,e*(x) = J]Pe|(x|h)P,.(h) 
h 

h 

We want to compare pi and p2- 

Define the in-model upper bound U^°'^'^^{dj) as in (18) above. Then we 
have e*j = argmaicg ^U^°'^''\6i) and Oj = argmax^^ ^/i,(0/). Since U^"'^"^ ^ 
Uv, we have Z^™°'^^1(6'|) ^ Ud{Oi). The BLM upper bound Ud{9i) is attained 
when we use q-p as the distribution for h, so U^°'^''\9*j) ^ Ud{9i) means that 
the performance of p^ is better than the performance of p2 '■ 

Dkl{Pv\\P3) ^ Dkl{Pv\\P2) 

(inequalities hold in the reverse order for data log-likelihood). 

Now by definition of the optimum 6* , the distribution p2 is better than pi : 
-Ckl(-Pd IIP2) ^ -Ckl(-Pd IIpi)- Consequently, the difference in performance 
between p2 and pi (whether expressed in data log-likelihood or in Kullback- 
Leibler divergence) is smaller than the difference in performance between p^ 
and pi , which is the difference of Kullback-Leibler divergences appearing in 
the proposition. 
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Let us now evaluate more precisely the loss of pi with respect to p^. 
By abuse of notation we will indifferently denote pi(h) and pi(x), it being 
understood that one is obtained from the other through P^^(x|h), and likewise 

for p3 (with the same 9j). 

For any distributions pi and ps the loss of performance of pi w.r.t. ^3 
satisfies 



Ex-Pi, logP3(x) - Kx-Pi, logpi(x) = Ex-Pi, 



log 



Eh^g,(x|h)p3(h) 

Eh^^,(x|h)pi(h) 



and by the log sum inequality log(^ai/^6i) ^ ^ Oj log(aj/6i) [9, 
Theorem 2.7.1] we get 

lEx-Pj, logP3(x) - Kx-Pi, log Pi (x) 

1 



X~Pr 



1 (x|h)p3(h) 

;-^x:P3(x,h)iog«ffl 

(x) ^ Pi(h) 



P3 



x~Pi, 



IEx~Pi,Eh^P3(h|x) 



> P3 hx log— 77;t 

;^3(h) 

pi(h) 



log 



Given a probability p^ on (x, h), the law on h obtained by taking an x 
according to Px>, then taking an h according to p3(h|x), is generally not equal 
top3(h). However, here p^ is equal to the BLM qo, and by Proposition 7 below 
the BLM has exactly this property (which characterizes the log-likelihood 
extrema). Thus thanks to Proposition 7 we have 



IEx~Pi,Eh^^j,(h|x) 



log 



pi{h) 

which concludes the argument. 



E- 



log 



gp(h) 
Pi(h) 



Z)KL(gB(h)||pi(h)) 



□ 



2.3 Relation with Stacked RBMs 

Stacked RBMs (SRBMs) [12, 1, 14] are deep generative models trained by 
stacking restricted Boltzmann machines (RBMs) [2-]]. 

A RBM uses a single set of parameters to represent a distribution on 
pairs (x, h). Similarly to our approach, stacked RBMs are trained in a greedy 
layer-wise fashion: one starts by training the distribution of the bottom RBM 
to approximate the distribution of x. To do so, distributions Pe^(x|h) and 
Q0j{h\x.) are learned jointly using a single set of parameters 9j. Then a 
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target distribution for h is defined as (56ij(h|x)P75(x) (similarly to (11)) 
and the top layers are trained recursively on this distribution. 

In the final generative model, the full top RBM is used on the top layer 
to provide a distribution for h, then the bottom RBMs are used only for the 
generation of x knowing h. (Therefore the h-biases of the bottom RBMs are 
never used in the final generative model.) 

Thus, in contrast with our approach, Pg^(x|h) and Qgj(h\x) are not 
trained to maximize the least upper bound of the likelihood of the full deep 
generative model but are trained to maximize the likelihood of a single RBM. 

This procedure has been shown to be equivalent to maximizing the 
likelihood of a deep generative model with infinitely many layers where the 
weights are all tied [ I ' '] . The latter can be interpreted as an assumption on 
the future value of P{h), which is unknown when learning the first layer. As 
such, SRBMs make a different assumption about the future P{h.) than the 
one made in (10). 

With respect to this, the comparison of gradient ascents is instructive: 
the gradient ascent for training the bottom RBM takes a form reminiscent of 
gradient ascent of the global generative model (7) but in which the dependency 
of -P(h) on the upper layers 9j is ignored, and instead the distribution P{h.) 
is tied to Oj because the RBM uses a single parameter set for both. 

When adding a new layer on top of a trained RBM, if the initialization 
is set to an upside down version of the current RBM (which can be seen as 
"unrolling" one step of Gibbs sampling), the new deep model still matches 
the special infinite deep generative model with tied weights mentioned above. 
Starting training of the upper layer from this initialization guarantees that 
the new layer can only increase the likelihood [12]. However, this result is 
only known to hold for two layers; with more layers, it is only known that 
adding layers increases a bound on the likelihood [12]. 

In our approach, the perspective is different. During the training of lower 
layers, we consider the best possible model for the hidden variable. Because 
of errors which are bound to occur in approximation and optimization during 
the training of the model for P(h), the likelihood associated with an optimal 
upper model (the BLM upper bound) is expected to decrease each time we 
actually take another lower layer into account: At each new layer, errors in 
approximation or optimization occur so that the final likelihood of the training 
set will be smaller than the upper bound. (On the other way these limitations 
might actually improve performance on a test set, see the discussion about 
regularization in Section 3.) 

In [ ] a training criterion is suggested for SRBMs which is reminiscent of 
a BLM with tied weights for the inference and generative parts (and therefore 
without the BLM optimality guarantee), see also Section 2.5. 
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2.4 Relation with Auto-Encoders 

Since the introduction of deep neural networks, auto-encoders [ ] have been 
considered a credible alternative to stacked RBMs and have been shown to 
have almost identical performance on several tasks [ ]. 

Auto-encoders are trained by stacking auto-associators [ ] trained with 
backpropagation. Namely: we start with a three-layer network x i— )• h^^^ i— )• 
X trained by backpropagation to reproduce the data; this provides two 
conditional distributions P(h(^)|x) and P(x|h(^)). Then in turn, another 
auto-associator is trained as a three-layer network h^^^ i— )■ h^^^ i— )• h.^^\ to 
reproduce the distribution P(h(^)|x) on h^^^ etc. 

So as in the learning of SRBMs, auto-encoder training is performed in a 
greedy layer-wise manner, but with a different criterion: the reconstruction 
error. 

Note that after the auto-encoder has been trained, the deep generative 
model is incomplete because it lacks a generative model for the distribution 
p^jjfcmax'^ of the deepest hidden variable, which the auto-encoder does not 
provide^. One possibility is to learn the top layer with an RBM, which then 
completes the generative model. 

Concerning the theoretical soundness of stacking auto-associators for train- 
ing deep generative models, it is known that the training of auto-associators 
is an approximation of the training of RBMs in which only the largest term 
of an expansion of the log- likelihood is kept [ ]. In this sense, SRBM and 
stacked auto-associator training approximate each other (see also Section 2.5). 

Our approach gives a new understanding of auto-encoders as the lower 
part of a deep generative model, because they are trained to maximize a 
lower bound of (10), as follows. 

To fix ideas, let us consider for (10) a particular class of conditional 
distributions g(h|x) commonly used in auto-associators. Namely, let us 
parametrize q as with 

g5(h|x) = llq^{h,\^) (21) 

j 

q^{hj\x) = sigin{J2iXiWij + bj) (22) 

where the parameter vector is ^ = {W, b} and sigm(-) is the sigmoid function. 

Given a conditional distribution (7(h|x) as in Theorem 1, let us expand 
the distribution on x obtained from Pe^(x|h) and qx>{h): 

P(x) = ^Pe,(x|h)gi,(h) (23) 
h 

= ^P,,(x|h)^g(h|i)P^(x) (24) 
h it 

■^Auto-associators can in fact be used as valid generative models from which sampling is 
possible [il l] in the setting of manifold learning but this is beyond the scope of this article. 
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where as usual P^) is the data distribution. Keeping only the terms x = x in 
this expression we see that 

P(x)^^Pe,(x|hMh|x)Pi,(x) (25) 

h 

Taking the sum of likelihoods over x in the dataset, this corresponds to the 
criterion maximized by auto-associators when they are considered from a 
probabilistic perspective^. Since moreover optimizing over g as in (10) is more 
general than optimizing over the particular class q^, we conclude that the 
criterion optimized in auto-associators is a lower bound on the criterion (10) 
proposed in Theorem 1. 

Keeping only x = x is justified if we assume that inference is an approxi- 
mation of the inverse of the generative process^, that is, P0j-(x|h)q(h|x) ~ 
as soon as x 7^ x. Thus under this assumption, both criteria will be close, 
so that Theorem 1 provides a justification for auto-encoder training in this 
case. On the other hand, this assumption can be strong: it implies that no 
h can be shared between different x, so that for instance two observations 
cannot come from the same underlying latent variable through a random 
choice. Depending on the situation this might be unrealistic. Still, using this 
as a training criterion might perform well even if the assumption is not fully 
satisfied. 

Note that we chose the form of g^(h|x) to match that of the usual auto- 
associator, but of course we could have made a different choice such as using 
a multilayer network for gg(h|x) or Pe^(x|h). These possibilities will be 
explored later in this article. 

2.5 From stacked RBMs to auto-encoders: layer-wise consis- 
tency 

We now show how imposing a "layer-wise consistency" constraint on stacked 
RBM training leads to the training criterion used in auto-encoders with tied 
weights. Some of the material here already appears in [. .]. 

Let us call layer-wise consistent a layer-wise training procedure in which 
each layer determines a value 9j for its parameters and a target distribution 

^In all fairness, the training of auto-associators by backpropagation, in probabilistic 
terms, consists in the maximization of P(y|x)P-D(x) = o(x)P-d(x) with y = x [ ], where o 
is the output function of the neural network. In this perspective, the hidden variable h is 
not considered as a random variable but as an intermediate value in the form of P(y|x). 
Here, we introduce h as an intermediate random variable as in [1^]. The criterion we 
wish to maximize is then P(y|x)P-D(x) = /(y|h)g(h|x)pD(x), with y = x. Training 
with backpropagation can be done by sampling h from (/(h|x) instead of using the raw 
activation value of (7(hjx), but in practice we do not sample h as it does not significantly 
affect performance. 

*which is a reasonable assumption if we are to perform inference in any meaningful 
sense of the word. 
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P(h) for the upper layers which are mutuahy optimal in the following sense: 
if P(h) is used a the distribution of the hidden variable, then 9i is the bottom 
parameter value maximizing data log-likelihood. 

The BLM training procedure is, by construction, layer- wise consistent. 

Let us try to train stacked RBMs in a layer-wise consistent way. Given a 
parameter 0/, SRBMs use the hidden variable distribution 

Qi,,e,(h) =Ex^P^Pe,(h|x) (26) 

as the target for the next layer, where Pe^(h|x) is the REM distribution of h 
knowing x. The value 9i and this distribution over h are mutually optimal 
for each other if the distribution on x stemming from this distribution on h, 
given by 

P«(x)=Eh.Q^^^^(h)Pe,(x|h) (27) 
= j;Pe,(x|h)^P,,(h|x)Pi,(i) (28) 

h X 

maximizes log-likelihood, i.e., 

Oi = argminZ)KL(i^c(x) || Pi'^(x)) (29) 

The distribution P^^^ (x) is the one obtained from the data after one "forward- 
backward" step of Gibbs sampling x — )• h — )• x (cf . [16]). 

But P'g^ (x) is also equal to the distribution (24) for an auto-encoder with 
tied weights. So the layer-wise consistency criterion for RBMs coincides with 
tied-weights auto-encoder training, up to the approximation that in practice 
auto-encoders retain only the terms x = x in the above (Section 2.4). 

On the other hand, stacked RBM training trains the parameter 9i to 
approximate the data distribution by the RBM distribution: 

e|^l^^^ = argminDKL(P7?(x)||P,^^^(x)) (30) 

where p^^M jg ^.j^g probability distribution of the RBM with parameter 6i, 
i.e. the probability distribution after an infinite number of Gibbs samplings 
from the data. 

Thus, stacked RBM training and tied-weight auto-encoder training can 
be seen as two approximations to the layer-wise consistent optimization 
problem (29), one using the full RBM distribution Pf-^^ instead of p'^^^ and 

the other using x = x in Pj^^ 

It is not clear to us to which extent the criteria (29) using P^^^ and (30) 
using p^^^ actually yield different values for the optimal Oj: although these 
two optimization criteria are different (unless RBM Gibbs sampling converges 
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in one step) , it might be that the optimal 6j is the same (in which case SRBM 
training would be layer- wise consistent), though this seems unlikely. 

The 9 J obtained from the layer- wise consistent criterion (29) using Pg^^ (x) 
will always perform at least as well as standard SRBM training if the upper 
layers match the target distribution on h perfectly — this follows from its very 
definition. 

Nonetheless, it is not clear whether layer-wise consistency is always a 
desirable property. In SRBM training, replacing the RBM distribution over h 
with the one obtained from the data seemingly breaks layer- wise consistency, 
but at the same time it always improves data log-likelihood (as a consequence 
of Proposition 7 below). 

For non-layer- wise consistent training procedures, fine-tuning of 6j after 
more layers have been trained would improve performance. Layer-wise 
consistent procedures may require this as well in case the upper layers do not 
match the target distribution on h (while non-layer- wise consistent procedures 
would require this even with perfect upper layer training). 

2.6 Relation to fine-tuning 

When the approach presented in Section 2 is used recursively to train deep 
generative models with several layers using the criterion (10), irrecoverable 
losses may be incurred at each step: first, because the optimization problem 
(10) may be imperfectly solved, and, second, because each layer was trained 
using a BLM assumption about what upper layers are able to do, and 
subsequent upper layer training may not match the BLM. Consequently the 
parameters used for each layer may not be optimal with respect to each other. 
This suggests using a fine-tuning procedure. 

In the case of auto-encoders, fine-tuning can be done by backpropagation 
on all (inference and generative) layers at once (Figure 1). This has been 
shown to improve performance^ in several contexts [14, 13], which confirms 
the expected gain in performance from recovering earlier approximation losses. 
In principle, there is no limit to the number of layers of an auto-encoder that 
could be trained at once by backpropagation, but in practice training many 
layers at once results in a difficult optimization problem with many local 
minima. Layer-wise training can be seen as a way of dealing with the issue of 
local minima, providing a solution close to a good optimum. This optimum 
is then reached by global fine-tuning. 

Fine-tuning can be described in the BLM framework as follows: fine- 
tuning is the maximization of the BLM upper bound (10) where all the layers 
are considered as one single complex layer (Figure 1). In the case of auto- 
encoders, the approximation x = x in (lO)-(ll) is used to help optimization, 
as explained above. 

^The exact likelihood not being tractable for larger models, it is necessary to rely on a 
proxy such as classification performance to evaluate the performance of the deep network. 
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Note that there is no reason to Umit fine-tuning to the end of the layer- 
wise procedure: fine-tuning may be used at intermediate stages where any 
number of layers have been trained. 

This fine-tuning procedure was not applied in the experiments below 
because our experiments only have one layer for the bottom part of the 
model. 

As mentioned before, a generative model for the topmost hidden layer 
(e.g., an RBM) still needs to be trained to get a complete generative model 
after fine-tuning. 



2.7 Data Incorporation: Properties of qj) 

It is not clear why it should be more interesting to work with the conditional 
distribution g(h|x) and then define a distribution on h through qx>, rather 
than working directly with a distribution Q on h. 

The first answer is practical: optimizing on Pe^(x|h) and on the distribu- 
tion of h simultaneously is just the same as optimizing over the global network, 
while on the other hand the currently used deep architectures provide both 
x|h and h|x at the same time. 

A second answer is mathematical: qx) is defined through the dataset 
T). Thus by working on g(h|x) we can concentrate on the correspondence 
between h and x and not on the full distribution of either, and hopefully this 
correspondence is easier to describe. Then we use the dataset T) to provide 
qj)'. so rather than directly crafting a distribution (5(h), we use a distribution 
which automatically incorporates aspects of the data distribution P even for 
very simple q. Hopefully this is better; we now formalize this argument. 

Let us fix the bottom layer parameters and consider the problem 
of finding the best latent marginal over h, i.e., the Q maximizing the data 
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log-likelihood 



argmaxEx^Pj, log ^ Pe^(x|h)(5(h) 



(31) 



h 



Let Q(h) be a candidate distribution. We might build a better one by 
"reflecting the data" in it. Namely, Q{h) defines a distribution P0j{x.\h.)Q{h.) 
on (x, h). This distribution, in turn, defines a conditional distribution of h 
knowing x in the standard way: 



We can turn Q'^ (h|x) into a new distribution on h by using the data 
distribution: 



and in general Q^'^'^ih) will not coincide with the original distribution (3(h), 
if only because the definition of the former involves the data whereas Q 
is arbitrary. We will show that this operation is always an improvement: 
Q^'^'^{h.) always yields a better data log-likelihood than Q. 

Proposition 7. Let data incorporation be the map sending a distribution 
(5(h) to Q^^'^{h.) defined by (32) and (33), where 6j is fixed. It has the 
following properties: 

• Data incorporation always increases the data log-likelihood (31). 

• The best latent marginal Qej^v is a fixed point of this transformation. 
More precisely, the distributions Q that are fixed points of data incorpo- 
ration are exactly the critical points of the data log-likelihood (31) (by 
concavity of (31) these critical points are all maxima with the same 
value). In particular if the BLM is uniquely defined (the arg max in 
(13) is unique), then it is the only fixed point of data incorporation. 

• Data incorporation Q i— t- Q^^'^ coincides with one step of the expectation- 
maximization (EM) algorithm to maximize data log-likelihood by opti- 
mizing over Q for a fixed 6j, with h as the hidden variable. 

This can be seen as a justification for constructing the hidden variable 
model Q through an inference model g(h|x) from the data, which is the basic 
approach of auto-encoders and the BLM. 

Proof. Let us first prove the statement about expectation-maximization. 
Since the EM algorithm is known to increase data log-likelihood at each step 
[10, 25], this will prove the first statement as well. 

For simplicity let us assume that the data distribution is uniform over 
the dataset T> = (xi, . . . ,x„). (Arbitrary data weights can be approximated 




(32) 




(33) 



X 
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by putting tlie same observation several times into the data.) The hidden 
variable of the EM algorithm will be h, and the parameter over which the EM 
optimizes will be the distribution (5(h) itself. In particular we keep 9j fixed. 
The distributions Q and Pqj define a distribution P(x, h) := (x|h)(5(h) 
over pairs (x, h). This extends to a distribution over n-tuples of observations: 

P((xi, hi), . . . , (x„, h„)) = H Pe,i^i\hi)Q{hi) 

i 

and by summing over the states of the hidden variables 

P(xi,...,x„) = ^ P((xi,hi),...,(x„,h„)) 

(hi,...,h„) 

Denote x = (xi, . . . , x„) and h = (hi, . . . , hn). One step of the EM 
algorithm operating with the distribution Q as parameter, is defined as 
transforming the current distribution Qt into the new distribution 

Qt+i = argmax Pt(h|x) logP(x, h) 
^ h 

where P* (x, h) = Pg^ (x|h)(5t(h) is the distribution obtained by using Qt for h, 
and P the one obtained from the distribution Q over which we optimize. Let 
us follow a standard argument for EM algorithms on n-tuples of independent 
observations: 

J2 (h|x) log P(x, h) = ^ Pt(h|x) log n P(.^i, hi) 
h h » 

= I]^^t(h|x) logP(xi,hi) 
» h 

Since observations are independent, Pi(h|x) decomposes as a product and so 
5^X^(logP(xi,hO)Pt(h|x) = ^ ^ (logP(x„h,))n^'*(hi|x,) 

i h * hi,...,h„ j 

= E E(^°g h,))Pi(h,|x,) n ^ Pt(h, |x,) 

i hi j^i hj 

but of course ^j^^. Pt(hj|xj) = 1 so that finally 

J]Pt(h|x)IogP(x,h) = ^ J](logP(x,,h,))Pt(h,|x,) 

h ' 

= j;5;(logP(x,,h))P,(h|x,) 

h i 

= ^^(iogPe,(x,|h) + iogg(h))Pt(h|x,) 

h i 
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because P(x, h) = Pg^(x|h)Q(h). We have to maximize tliis quantity over 
Q. The first term does not depend on Q so we only have to maximize 

EhE^(iogQ(h))Pt(h|x,). 

This latter quantity is concave in Q, so to find the maximum it is sufficient 
to exhibit a point where the derivative w.r.t. Q (subject to the constraint 
that Q is a probability distribution) vanishes. 

Let us compute this derivative. If we replace Q with Q + 5Q where 5Q is 
infinitesimal, the variation of the quantity to be maximized is 

^5:(51ogQ(h))P,(h|x,) = ^ ^5:Pt(h|x,) 

Let us take Q = {Qt)'^^^'^. Since we assumed for simplicity that the data 
distribution P is uniform over the sample this is 



Q(h) = (gi)g*-<i(h) = i^Pi(h|x 



so that the variation of the quantity to be maximized is 

i:^E^'.(''ix.) = nx;«(h) 

But since Q and Q + 8Q arc both probability distributions, both sum to 
1 over h so that Eh^^C^) ~ ^- This proves that this choice of Q is an 
extremum of the quantity to be maximized. 

This proves the last statement of the proposition. As mentioned above, it 
implies the first by the general properties of EM. Once the first statement 
is proven, the best latent marginal Qq^^-d has to be a fixed point of data 
incorporation, because otherwise we would get an even better distribution 
thus contradicting the definition of the BLM. 

The only point left to prove is the equivalence between critical points of 
the log-likelihood and fixed points oiQ^ Q'^^^'^ ■ This is a simple instance 
of maximization under constraints, as follows. Critical points of the data log- 
likelihood arc those for which the log-likelihood docs not change at first order 
when Q is replaced with Q + SQ for small bQ. The only constraint on 8Q is 
that Q -|- bQ must still be a probability distribution, so that ^\^bQ{U) = 
because both Q and Q + 5Q sum to 1. 
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The first-order variation of log-likeliliood is 



^5^1ogP(x,) = ,55^1og j;Pe,(x.|h)Q(h) 

i i \ h / 

«El,^'«,(''.|li)<3(h) 



E 

i 



Eh^'e,(xi,h)(3(h) 
Eh^'<>,('',|h)iQ(h) 



h i ^ *^ 



Q(h) 



Tliis must vanisli for any 6Q such that X]h'^'3(^) ~ ^- elementary linear 
algebra (or Lagrange multipliers) this occurs if and only if - ^^^j^^'^ does 
not depend on h, i.e., if and only if Q satisfies Q{h) = C^-P(h|xj). Since 
Q sums to 1 one finds C = ^ . Since all along P is the probability distribution 
on X and h defined by Q and Pg^(x|h), namely, P(x, h) = Pg^ (x|h)(5(h), 
by definition we have P(h|x) = (5™'^^(h|x) so that the condition Q(h) = 
^ ^ ■ P(h|xj) exactly means that Q = Q^°*^, hence the equivalence between 
critical points of log-likelihood and fixed points of data incorporation. □ 



3 Applications and Experiments 

Given the approach described above, we now consider several applications 
for which we evaluate the method empirically. 

The intractability of the log-likelihood for deep networks makes direct 
comparison of several methods difficult in general. Often the evaluation is 
done by using latent variables as features for a classification task and by 
direct visual comparison of samples generated by the model [14, 2J]. Instead, 
we introduce two new datasets which are simple enough for the true log- 
likelihood to be computed explicitly, yet complex enough to be relevant to 
deep learning. 

We first check that these two datasets are indeed deep. 

Then we try to assess the impact of the various approximations from 
theory to practice, on the validity of the approach. 

We then apply our method to the training of deep belief networks using 
properly modified auto-encoders, and show that the method outperforms 
current state of the art. 
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We also explore the use of the BLM upper bound to perform layer-wise 
hyper-parameter selection and show that it gives an accurate prediction of 
the future log-likelihood of models. 

3.1 Low-Dimensional Deep Datasets 

We now introduce two new deep datasets of low dimension. In order for those 
datasets to give a reasonable picture of what happens in the general case, 
we first have to make sure that they are relevant to deep learning, using the 
following approach: 

1. In the spirit of [(>], we train 1000 RBMs using CD-I [ ] on the dataset 
D, and evaluate the log-likelihood of a disjoint validation dataset V 
under each model. 

2. We train 1000 2-layer deep networks using stacked RBMs trained with 
CD-I on D, and evaluate the log-likelihood of V under each model. 

3. We compare the performance of each model at equal number of param- 
eters. 

4. If deep networks consistently outperform single RBMs for the same 
number of parameters, the dataset is considered to be deep. 

The comparison at equal number of parameters is justified by one of the main 
hypotheses of deep learning, namely that deep architectures are capable of 
representing some functions more compactly than shallow architectures [2]. 

Hyper-parameters taken into account for hyper-parameter random search 
are the hidden layers sizes, CD learning rate and number of CD epochs. The 
corresponding priors are given in Table 1. In order not to give an obvious 
head start to deep networks, the possible layer sizes are chosen so that the 
maximum number of parameters for the single RBM and the deep network 
are as close as possible. 

Cmnist dataset 

The Cmnist dataset is a low-dimensional variation on the Mnist dataset 
[17], containing 12,000 samples of dimension 100. The full dataset is split 
into training, validation and test sets of 4,000 samples each. The dataset is 
obtained by taking a 10 x 10 image at the center of each Mnist sample and 
using the values in [0,1] as probabilities. The first 10 samples of the Cmnist 
dataset are shown in Figure 2. 

We propose two baselines to which to compare the log-likelihood values 
of models trained on the Cmnist dataset: 

1. The uniform coding scheme: a model which gives equal probability to 
all possible binary 10 x 10 images. The log-likelihood of each sample is 
then —100 bits, or —69.31 nats. 
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Parameter 


Prior 


RBM hidden layer size 


1 to 19 


Deep Net hidden layer 1 size 


1 to 16 


Deep Net hidden layer 2 size 


1 to 16 


inference hidden layer size 


1 to 500 


CD learn rate 


logC/(10-^5 X IQ-^) 


BP learn rate 


log[/(10-^5 X IQ-^) 


CD epochs 


20 X (10000/iV) 


BP epochs 


20 X (10000/iV) 


ANN init a 


U{0,1) 



Table 1: Search space for hyper-parameters when using random search for a 
dataset of size A^. 



sm 



Figure 2: First 10 samples of the Cmnist dataset. 



2. The independent Bernoulli model in which each pixel is given an 
independent Bernoulli probability. The model is trained on the training 
set. The log-likelihood of the validation set is —67.38 nats per sample. 

The comparison of the log-likelihood of stacked RBMs with that of single 
RBMs is presented in Figure 3 and confirms that the Cmnist dataset is deep. 

Tea dataset 

The Tea dataset is based on the idea of learning an invariance for the amount 
of liquid in several containers: a teapot and 5 teacups. It contains 243 distinct 
samples which are then distributed into a training, validation and test set 
of 81 samples each. The dataset consists of 10 x 10 images in which the left 
part of the image represents a (stylized) teapot of size 10 x 5. The right part 
of the image represents 5 teacups of size 2x5. The liquid is represented by 
ones and always lies at the bottom of each container. The total amount of 
liquid is always equal to the capacity of the teapot, i.e., there are always 
50 ones and 50 zeros in any given sample. The first 10 samples of the Tea 
dataset are shown in Figure 4. 

In order to better interpret the log-likelihood of models trained on the 
Tea dataset, we propose 3 baselines: 

1. The uniform coding scheme: the baseline is the same as for the Cmnist 
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Figure 3: Checking that Cmnist is deep: log-hkeUhood of the vahdation 
dataset V under RBMs and SRBM deep detworks selected by hyper-parameter 
random search, as a function of the number of parameters dim(0). 




Figure 4: First 10 samples of the Tea dataset. 

dataset: —69.31 nats. 

2. The independent Bernoulli model, adjusted on the training set. The 
log-likelihood of the validation set is —49.27 nats per sample. 

3. The perfect model in which all 243 samples of the full dataset (consituted 
by concatenation of the training, validation and test sets) are given 
the probability gjg- The expected log-likelihood of a sample from the 
validation dataset is then log(2^) = —5.49 nats. 

The comparison of the log-likelihood of stacked RBMs and single RBMs is 
presented in Figure 5 and confirms that the Tea dataset is deep. 
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Figure 5: Checking that Tea is deep: log-Hkehhood of the vahdation dataset V 
under RBMs and SRBM deep networks selected by hyper-parameter random 
search, as a function of the number of parameters dim(0). 

3.2 Deep Generative Auto-Encoder Training 

A first application of our approach is the training of a deep generative model 
using auto-associators. To this end, we propose to train lower layers using 
auto-associators and to use an RBM for the generative top layer model. 

We will compare three kinds of deep architectures: standard auto-encoders 
with an RBM on top (vanilla AEs), the new auto-encoders with rich inference 
(AERIes) suggested by our framework, also with an RBM on top, and, for 
comparison, stacked restricted Boltzmann machines (SRBMs). All the models 
used in this study use the same final generative model class for P(x|h) so 
that the comparison focuses on the training procedure, on equal ground. 
SRBMs are considered the state of the art [12, 1] — although performance can 
be increased using richer models [•]], our focus here is not on the model but 
on the layer-wise training procedure for a given model class. 

In ideal circumstances, we would have compared the log-likelihood ob- 
tained for each training algorithm with the optimum of a deep learning 
procedure such as the full gradient ascent procedure (Section 2). Instead, 
since this ideal deep learning procedure is intractable, SRBMs serve as a 
reference. 

The new AERIes are auto-encoders modified after the following remark: 
the complexity of the inference model used for g(h|x) can be increased safely 
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without risking overfit and loss of generalization power, because q is not part 
of the final generative model, and is used only as a tool for optimization of the 
generative model parameters. This would suggest that the complexity of q 
could be greatly increased with only positive consequences on the performance 
of the model. 

AERIes exploit this possibility by having, in each layer, a modified auto- 
associator with two hidden layers instead of one: x — >■ h' — ?■ h ^ x. The 
generative part Pgj{x\h.) will be equivalent to that of a regular auto-associator, 
but the inference part (7(h|x) will have greater representational power because 
it includes the hidden layer h' (see Figure 7). 

We will also use the more usual auto-encoders composed of auto-associators 
with one hidden layer and tied weights, commonly encountered in the litera- 
ture (vanilla AE). 

For all models, the deep architecture will be of depth 2. The stacked 
RBMs will be made of two ordinary RBMs. For AERIes and vanilla AEs, the 
lower part is made of a single auto-associator (modified for AERies), and the 
generative top part is an RBM. (Thus they have one layer less than depicted 
for the sake of generality in Figures 6 and 7.) For AERIes and vanilla AEs the 
lower part of the model is trained using the usual backpropagation algorithm 
with cross-entropy loss, which performs gradient ascent for the probability of 
(25). The top RBM is then trained to maximize (12). 

The competitiveness of each model will be evaluated through a compar- 
ison in log-likelihood over a validation set distinct from the training set. 
Comparisons are made for a given identical number of parameters of the 
generative model^. Each model will be given equal chance to find a good 
optimum in terms of the number of evaluations in a hyper-parameter selection 
procedure by random search. 

When implementing the training procedure proposed in Section 2, several 
approximations are needed. An important one, compared to Theorem 1, 
is that the distribution g(h|x) will not really be trained over all possible 
conditional distributions for h knowing x. Next, training of the upper layers 
will of course fail to reproduce the BLM perfectly. Moreover, auto-associators 
use an X = X approximation, cf. (25). We will study the effect of these 
approximations . 

Let us now provide more details for each model. 

Stacked RBMs. For our comparisons, 1000 stacked RBMs were trained us- 
ing the procedure from [ ^ ]. We used random search on the hyper-parameters, 
which are: the sizes of the hidden layers, the CD learning rate, and the number 
of CD epochs. 

^Because we only consider the generative models obtained, q is never taken into account 
in the number of parameters of an auto-encoder or SRBM. However, the parameters of the 
top RBM are taken into account as they are a necessary part of the generative model. 
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train 1^' layer AA train 2"'' layer AA 




Figure 6: Deep generative auto-encoder training scheme. 



Vanilla auto-encoders. The general training algorithm for vanilla auto- 
encoders is depicted in Figure 6. First an auto-associator is trained to 
maximize the adaptation of the BLM upper bound for auto-associators 
presented in (25). The maximization procedure itself is done with the 
backpropagation algorithm and cross-entropy loss. The inference weights are 
tied to the generative weights so that Wgen = Wj^j as is often the case in 
practice. An ordinary RBM is used as a generative model on the top layer. 

1000 deep generative auto-encoders were trained using random search on 
the hyper-parameters. Because deep generative auto-encoders use an RBM 
as the top layer, they use the same hyper-parameters as stacked RBMs, but 
also backpropagation (BP) learning rate, BP epochs, and ANN init a (i.e. 
the standard deviation of the gaussian used during initialization). 

Auto-Encoders with Rich Inference (AERIes). The model and train- 
ing scheme for AERIes are represented in Figure 7. Just as for vanilla 
auto-encoders, we use the backpropagation algorithm and cross-entropy loss 
to maximize the auto-encoder version (25) of the BLM upper bound on the 
training set. No weights are tied, of course, as this does not make sense for 
an auto-associator with different models for P(x|h) and (7(h|x). The top 
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Figure 7: Deep generative modified auto-encoder (AERI) training scheme. 



RBM is trained afterwards. Hyper-parameters are tlie same as above, with 
in addition the size of the new hidden layer h'. 

Results 

The results of the above comparisons on the Tea and Cmnist validation 
datasets are given in Figures 8 and 9. For better readability, the Pareto 
front^ for each model is given in Figures 10 and 11. 

As expected, all models perform better than the baseline independent 
Bernoulli model but have a lower likelihood than the perfect model^. Also, 

'^The Pareto front is composed of all models which are not subsumed by other models 
according to the number of parameters and the expected log-likelihood. A model is said to 
be subsumed by another if it has strictly more parameters and a worse likelihood. 

*Note that some instances are outperformed by the uniform coding scheme, which may 
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Figure 8: Comparison of the average validation log-likelihood for SRBMs, 
vanilla AE, and AERIes on the Tea dataset. 
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Figure 9: Comparison of the average validation log-likelihood for SRBMs, 
vanilla AE, and AERIes on the Cmnist dataset. 
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Figure 10: Pareto fronts for the average validation log-likelihood and number 
of parameters for SRBMs, deep generative auto-encoders, and modified deep 
generative auto-encoders on the Tea dataset. 

SRBMs, vanilla AEs and AERIes perform better than a single REM, which 
can be seen as further evidence that the Tea and Cmnist are deep datasets. 

Among deep models, vanilla auto-encoders achieve the lowest performance, 
but outperform single RBMs significantly, which validates them not only as 
generative models but also as deep generative models. Compared to SRBMs, 
vanilla auto-encoders achieve almost identical performance but the algorithm 
clearly suffers from local optima: most instances perform poorly and only a 
handful achieve performance comparable to that of SRBMs or AERIes. 

As for the auto-encoders with rich inference (AERIes), they are able 
to outperform not only single RBMs and vanilla auto-encoders, but also 
stacked RBMs, and do so consistently. This validates not only the general 
deep learning procedure of Section 2, but arguably also the understanding of 
auto-encoders in this framework. 

The results confirm that a more universal model for q can significantly 
improve the performance of a model, as is clear from comparing the vanilla 

seem surprising. Because we are considering the average log-likelihood on a validation 
set, if even one sample of the validation set happens to be given a low probability by the 
model, the average log-likelihood will be arbitrarily low. In fact, because of roundoff errors 
in the computation of the log-likelihood, a few models have a measured performance of 
— oo. This does not affect the comparison of the models as it only affects instances for 
which performance is already very low. 
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Figure 11: Pareto fronts for the average validation log-likelihood and number 
of parameters for SRBMs, deep generative auto-encoders, and modified deep 
generative auto-encoders on the Cmnist dataset. 

and rich-inference auto-encoders. Let us insist that the rich- inference auto- 
encoders and vanilla auto-encoders optimize over exactly the same set of 
generative models with the same structure, and thus are facing exactly the 
same optimization problem (4). Clearly the modified training procedure 
yields improved values of the generative parameter 6. 

3.3 Layer- Wise Evaluation of Deep Belief Networks 

As seen in section 2, the BLM upper bound Ux>{Oi) is the least upper bound 
of the log-likelihood of deep generative models using some given 6j in the 
lower part of the model. This raises the question of whether it is a good 
indicator of the final performance of 9j. 

In this setting, there are a few approximations w.r.t. (10) and (12) that 
need to be discussed. Another point is the intractability of the BLM upper 
bound for models with many hidden variables, which leads us to propose and 
test an estimator in Section 3.3.4, though the experiments considered here 
were small enough not to need this unless otherwise specified. 

We now look, in turn, at how the BLM upper bound can be applied to 
log-likelihood estimation, and to hyper-parameter selection — which can be 
considered part of the training procedure. We first discuss various possible 
effects, before measuring them empirically. 
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3.3.1 Approximations in the BLM upper bound 

Consider the maximization of (14). In practice, we do not perform a specific 
maximization over q to obtain the BLM as in (14), but rely on the training 
procedure of 6j to maximize it. Tims tlie q resulting from a training procedure 
is generally not the globally optimal q from Theorem 1. In the experiments 
we of course use the BLM upper bound with the value of q resulting from 
the actual training. 

Definition 8. For 9j and q resulting from the training of a deep generative 
model, let 



be the empirical BLM upper bound. 

This definition makes no assumption about how 9j and q in the first layer 
have been trained, and can be applied to any layer-wise training procedure, 
such as SRBMs. 

Ideally, this quantity should give us an idea of the final performance of 
the deep architecture when we use 9j on the bottom layer. But there are 
several discrepancies between these BLM estimates and final performance. 

A first question is the validity of the approximation (34). The BLM upper 
bound Ux>{Oi) is obtained by maximization over all possible q which is of 
course untractable. The learned inference distribution q used in practice is 
only an approximation for two reasons: first, because the model for q may 
not cover all possible conditional distributions g(h|x), and, second, because 
the training of q can be imperfect. In effect UTi^q{9i) is only a lower bound of 
the BLM upper bound : Uv,q{6i) ^ Uv{6i)- 

Second, we can question the relationship between the (un-approximated) 
BLM upper bound (14) and the final log-likelihood of the model. The BLM 
bound is optimistic, and tight only when the upper part of the model manages 
to reproduce the BLM perfectly. We should check how tight it is in practical 
applications when the upper layer model for -P(h) is imperfect. 

In addition, as for any estimate from a training set, final performance 
on validation and test sets might be different. Performance of a model on 
the validation set is generally lower than on the training set. But on the 
other hand, in our situation there is a specific regularizing effect of imperfect 
training of the top layers. Indeed the BLM refers to a universal optimization 
over all possible distributions on h and might therefore overfit more, hugging 
the training set too closely. Thus if we did manage to reproduce the BLM 
perfectly on the training set, it could well decrease performance on the 
validation set. On the other hand, training the top layers to approximate the 
BLM within a model class Pqj introduces further regularization and could 




(34) 
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well yield higher final performance on the validation set than if the exact 
BLM distribution had been used. 

This latter regularization effect is relevant if we are to use the BLM upper 
bound for hyper-parameter selection, a scenario in which regularization is 
expected to play an important role. 

We can therefore expect: 

1. That the ideal BLM upper bound, being by definition optimistic, can 
be higher that the final likelihood when the model obtained for P(h) is 
not perfect. 

2. That the empirical bound obtained by using a given conditional distri- 
bution q will be lower than the ideal BLM upper bound either when q 
belongs to a restricted class, or when q is poorly trained. 

3. That the ideal BLM upper bound on the training set may be either 
higher or lower than actual performance on a validation set, because of 
the regularization effect of imperfect top layer training. 

All in all, the relationship between the empirical BLM upper bound used 
in training, and the final log-likelihood on real data, results from several 
effects going in both directions. This might affect whether the empirical BLM 
upper bound can really be used to predict the future performance of a given 
bottom layer setting. 

3.3.2 A method for single-layer evaluation and layer-wise hyper- 
parameter selection 

In the context of deep architectures, hyper-parameter selection is a difficult 
problem. It can involve as much as 50 hyper-parameters, some of them only 
relevant conditionally to others [ : , ]. To make matters worse, evaluating the 
generative performance of such models is often intractable. The evaluation 
is usually done w.r.t. classification performance as in [14, 5, 6], sometimes 
complemented by a visual comparison of samples from the model [12, 21]. 
In some rare instances, a variational approximation of the log-likelihood is 
considered [22, 21]. 

These methods only consider evaluating the models after all layers have 
been fully trained. However, since the training of deep architectures is done 
in a layer- wise fashion, with some criterion greedily maximized at each step, 
it would seem reasonable to perform a layer-wise evaluation. This would have 
the advantage of reducing the size of the hyper-parameter search space from 
exponential to linear in the number of layers. 

We propose to first evaluate the performance of the lower layer, after it has 
been trained, according to the BLM upper bound (34) (or an approximation 
thereof) on the validation dataset Pvalid- The measure of performance 
obtained can then be used as part of a larger hyper-parameter selection 
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algorithm such as [6, 5]. This results in further optimization of (10) over the 
hyper-parameter space and is therefore justified by Theorem 1. 

Evaluating the top layer is less problematic: by definition, the top layer 
is always a "shallow" model for which the true likelihood becomes more 
easily tractable. For instance, although RBMs are well known to have an 
intractable partition function which prevents their evaluation, several methods 
are able to compute close approximations to the true likelihood (such as 
Annealed Importance Sampling [19, 22]). The dataset to be evaluated with 
this procedure will have to be a sample of Y2^q(h\x)Px>{x). 

In summary, the evaluation of a two-layer generative model can be done 
in a layer- wise manner: 

1. Perform hyper-parameter selection on the lower layer using Ugj{T)) 
as a performance measure (preferably on a validation rather than 
training dataset, see below), and keep only the best possible lower 
layers according to this criterion. 

2. Perform hyper-parameter selection on the upper layer by evaluating the 
true likelihood of validation data samples transformed by the inference 
distribution, under the model of the top layer^. 

Hyper-parameter selection was not used in our experiments, where we 
simply used hyper-parameter random search. (This has allowed, in particular, 
to check the robustness of the models, as AERIes have been found to perform 
better than vanilla AEs on many more instances over hyper-parameter space.) 

As mentioned earlier, in the context of representation learning the top 
layer is irrelevant because the objective is not to train a generative model 
but to get a better representation of the data. With the assumption that 
good latent variables make good representations, this suggests that the BLM 
upper bound can be used directly to select the best possible lower layers. 

3.3.3 Testing the BLM and its approximations 

We now present a series of tests to check whether the selection of lower layers 
with higher values of the BLM actually results in higher log-likelihood for 
the final deep generative models, and to assess the quantitative importance 
of each of the BLM approximations discussed earlier. 

For each training algorithm (SRBMs, RBMs, AEs, AERIes), the compari- 
son is done using 1000 models trained with hyper-parameters selected through 
random search as before. The empirical BLM upper bound is computed using 
(34) above. 

^This could lead to a stopping criterion when training a model with arbitrarily many 
layers: for the upper layer, compare the likelihood of the best upper-model with the BLM 
of the best possible next layer. If the BLM of the next layer is not significatively higher 
than the likelihood of the upper-model, then we do not add another layer as it would not 
help to achieve better performance. 
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Figure 12: Comparison of the BLM upper bound on the first layer and the 
final log-likelihood on the Tea training dataset, for 1000 2-layer SRBMs 

Training BLM upper bound vs training log-likelihood. We first 
compare the value of the empirical BLM upper bound (Strain) over the 
training set, with the actual log- likelihood of the trained model on the training 
set. This is an evaluation of how optimistic the BLM is for a given dataset, 
by checking how closely the training of the upper layers manages to match 
the target BLM distribution on h. This is also the occasion to check the 
effect of using the g(h|x) resulting from actual learning, instead of the best q 
in all possible conditional distributions. 

In addition, as discussed below, this comparison can be used as a criterion 
to determine whether more layers should be added to the model. 

The results are given in Figures 12 and 13 for SRBMs, and 14 and 15 
for AERIes. We see that the empirical BLM upper bound (34) is a good 
predictor of the future log- likelihood of the full model on the training set. 
This shows that the approximations w.r.t. the optimality of the top layer and 
the universality of q can be dealt with in practice. 

For AERIes, a few models with low performance have a poor estimation 
of the BLM upper bound (estimated to be lower than the actual likelihood), 
presumably because of a bad approximation in the learning of q. This will 
not affect model selection procedures as it only concerns models with very 
low performance, which are to be discarded. 

If the top part of the model were not powerful enough (e.g., if the network 
is not deep enough), the BLM upper bound would be too optimistic and 
thus significantly higher than the final log-likelihood of the model. To further 
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Figure 13: Comparison of the BLM upper bound on the first layer and the 
final log-likelihood on the Cmnist training dataset, for 1000 2-layer SRBMs 
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Figure 14: Comparison of the BLM upper bound on the first layer and the 
final log-likelihood on the Tea training dataset, for 1000 2-layer AERIes 
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Figure 15: Comparison of the BLM upper bound on the first layer and the 
final log-likelihood on the Cmnist training dataset, for 1000 2-layer AERIes 

test this intuition we now compare the BLM upper bound of the bottom 
layer with the log-likelihood obtained by a shallow architecture with only one 
layer; the difference would give an indication of how much could be gained by 
adding top layers. Figures 16 and 17 compare the expected log-likelihood^'^ 
of the training set under the 1000 RBMs previously trained with the BLM 
upper bound^^ for a generative model using this RBM as first layer. The 
results contrast with the previous ones and confirm that final performance is 
below the BLM upper bound when the model does not have enough layers. 

The alignment in Figures 12 and 13 can therefore be seen as a confirmation 
that the Tea and Cmnist datasets would not benefit from a third layer. 

Thus, the BLM upper bound could be used as a test for the opportunity 
of adding layers to a model. 

Training BLM upper bound vs validation log-likelihood. We now 

compare the training BLM upper bound with the log-likelihood on a validation 
set distinct from the training set: this tests whether the BLM obtained 
during training is a good indication of the final performance of a bottom 
layer parameter. 

As discussed earlier, because the BLM makes an assumption where there is 

^"The log-likelihood reported in this specific experiment is in fact obtained with Annealed 
Importance Sampling (AIS). 

^^The BLM upper bound value given in this particular experiment is in fact a close 
approximation (see Section 3.3.4). 



37 



-60 

-70 



-70 -60 -50 -40 -30 -20 -10 

Upper Bound 



Figure 16: BLM on a too shallow model: comparison of the BLM upper 
bound and the AIS log-likelihood of an RBM on the Tea training dataset 
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Figure 17: BLM on a too shallow model: Comparison of the BLM upper 
bound and the AIS log-likelihood of an RBM on the Cmnist training dataset 
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Figure 18: Training BLM upper bound vs validation log-likelihood on the 
Tea training dataset 
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Figure 19: Training BLM upper bound vs validation log-likelihood on the 
Cmnist training dataset 
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no regularization, using the training BLM upper bound to predict performance 
on a validation set could be too optimistic: therefore we expect the validation 
log-likelihood to be somewhat lower than the training BLM upper bound. 
(Although, paradoxically, this can be somewhat counterbalanced by imperfect 
training of the upper layers, as mentioned above.) 

The results are reported in Figures 18 and 19 and confirm that the 
training BLM is an upper bound of the validation log-likelihood. As for 
regularization, we can see that on the Cmnist dataset where there are 4000 
samples, generalization is not very difficult: the optimal P{h) for the training 
set used by the BLM is in fact almost optimal for the validation set too. On 
the Tea dataset, the picture is somewhat different: there is a gap between 
the training upper-bound and the validation log-likelihood. This can be 
attributed to the increased importance of regularization on this dataset in 
which the training set contains only 81 samples. 

Although the training BLM upper bound can therefore not be considered 
a good predictor of the validation log-likelihood, it is still a monotonous 
function of the validation log-likelihood: as such it can still be used for 
comparing parameter settings and for hyper-parameter selection. 

Feeding the validation dataset to the BLM. Predictivity of the BLM 
(e.g., for hyper-parameter selection) can be improved by feeding the validation 
rather than training set to the inference distribution and the BLM. 

In the cases above we examined the predictivity of the BLM obtained 
during training, on final performance on a validation dataset. We have seen 
that the training BLM is an imperfect predictor of this performance, notably 
because of lack of regularization in the BLM optimistic assumption, and 
because we use an inference distribution q maximized over the training set. 

Some of these effects can easily be predicted by feeding the validation 
set to the BLM and the inference part of the model during hyper-parameter 
selection, as follows. 

We call validation BLM upper bound the BLM upper bound obtained by 
using the validation dataset instead of I? in (34). Note that the values q and Bj 
are still those obtained from training on the training dataset. This parallels 
the validation step for auto-encoders, in which, of course, reconstruction 
performance on a validation dataset is done by feeding this same dataset to 
the network. 

We now compare the validation BLM upper bound to the log-likelihood 
of the validation dataset, to see if it qualifies as a reasonable proxy. 

The results are reported in Figures 20 and 21. As predicted, the validation 
BLM upper bound is a better estimator of the validation log-likelihood 
(compare Figures 18 and 19). 

We can see that several models have a validation log-likelihood higher 
than the validation BLM upper bound, which might seem paradoxical. This 
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Figure 20: Validation upper bound vs log-likelihood on the Tea validation 
dataset 

is simply because the validation BLM upper bound still uses the parameters 
trained on the training set and thus is not formally an upper bound. 

The better overall approximation of the validation log-likelihood seems to 
indicate that performing hyper-parameter selection with the validation BLM 
upper bound can better account for generalization and regularization. 

3.3.4 Approximating the BLM for larger models 

The experimental setting considered here was small enough to allow for an 
exact computation of the various BLM bounds by summing over all possible 
states of the hidden variable h. However the exact computation of the BLM 
upper bound using Iy(v,q{di) as in (34) is not always possible because the 
number of terms in this sum is exponential in the dimension of the hidden 
layer h. 

In this situation we can use a sampling approach. For each data sample 
X, we can take K samples from each mode of the BLM distribution qx> (one 
mode for each data sample x) to obtain an approximation of the upper bound 
in 0{K X A'^^) where N is the size of the validation set. (Since the practitioner 
can choose the size of the validation set which need not necessarily be as 
large as the training or test sets, we do not consider the A^^ factor a major 
hurdle.) 

Definition 9. For 6j and q resulting from the training of a deep generative 
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Figure 21: Validation upper bound vs log-likelihood on the Cmnist validation 
dataset 



model, let 



K 



logJ]J^Pe,(x|h)g(h|5)P:p(x) 



fc=i 



(35) 



where for each x and k, h is sampled from q(h\x.). 

To assess the accuracy of this approximation, we take K = 1 and compare 

the values of lAx>^q{Oi) and of Ux>^q{Oj), on the Cmnist and Tea training 
datasets. The results are reported in Figures 22 and 23 for all three models 
(vanilla AEs, AERIes, and SRBMs) superimposed, showing good agreement. 



Conclusions 

The new layer-wise approach we propose to train deep generative models is 
based on an optimistic criterion, the BLM upper bound, in which we suppose 
that learning will be successful for upper layers of the model. Provided this 
optimism is justified a posteriori and a good enough model is found for the 
upper layers, the resulting deep generative model is provably close to optimal. 
When optimism is not justified, we provide an explicit bound on the loss of 
performance. 

This provides a new justification for auto-encoder training and fine-tuning, 
as the training of the lower part of a deep generative model, optimized using 
a lower bound on the BLM. 
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Figure 22: Approximation of the training BLM upper bound on the Tea 
training dataset 
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Figure 23: Approximation of the training BLM upper bound on the the 
Cmnist training dataset 
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This new framework for training deep generative models highlights the 
importance of using richer models when performing inference, contrary to 
current practice. This is consistent with the intuition that it is much harder 
to guess the underlying structure by looking at the data, than to derive the 
data from the hidden structure once it is known. 

This possibility is tested empirically with auto-encoders with rich inference 
(AERIes) which are completed with a top-RBM to create deep generative 
models: these are then able to outperform current state of the art (stacked 
RBMs) on two different deep datasets. 

The BLM upper bound is also found to be a good layer-wise proxy to 
evaluate the log-likelihood of future models for a given lower layer setting, 
and as such is a relevant means of hyper-parameter selection. 

This opens new avenues of research, for instance in the design of algorithms 
to learn features in the lower part of the model, or in the possibility to consider 
feature extraction as a partial deep generative model in which the upper part 
of the model is temporarily left unspecified. 
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